LONDON DATASTORE: Recommendations, workflows and challenges from a data analyst´s perspective

A deep-dive from an academic’s, an analyst’s and GIS researcher’s perspective demonstrating common workflows integrating the London Datastore. The proposed recommendations and modifications for the Future London Datastore are designed to enhance the Datastore as the premier choice for analysts seeking up-to-date and precise data on London
Author

University College London - UCL |Julian Hoffmann Anton

Published

14-12-2023

DRAFT - Executive Summary

This report presents an extensive analysis of the current London Datastore (LDS), examining common use cases and improvements to conciser for the future developments and expansions of the platform. It offers an in-depth exploration from three distinct viewpoints with common interests: academic research, data analysis, and Geographic Information System (GIS) research. This multifaceted approach demonstrates typical workflows for integrating data from the London Datastore into various fields of study and application. The proposed recommendations and modifications are designed to enhance the Datastore as the premier choice for analysts seeking up-to-date and precise data on London.

The first section revisits the previously set priorities for the future development of the London Data Catalogue (LDC). Following this, we introduce the variety of user tools for data analysis and mapping. We then present an analysis of the LDC’s current metadata, acknowledging its complex scope and identifying areas requiring enhancements for the user and overall potential of the platform. Then we weave in observations and recommendations from the workflows of contemporary data analysts, researchers, and academics. This journey, from the initial idea to the final output, leverages the LDC as a central resource, through various data types, access formats, and frameworks. Consequently, we have identified and summarized 15 key observations and recommendations, which are detailed in the following sections.

[to be continued]

Link to the London Datastore : https://data.london.gov.uk

DRAFT - Key Recommendations

[Provisory hierarchical order]

  1. Encourage data producers to adopt a standardized approach for data sharing formats, ensuring that dataset format is tidy1, consistent, and efficient.

  2. Emphasize on key geographical datasets and mapping shapes the platform that are most commonly used for Geographic Information System (GIS) applications in London.

  3. Enrich the metadata tagging and labeling, while improving the metadata quality and consistency.

  4. Highlight the platforms API capabilities and facilitate its use by guiding the user through examples.

  5. Correct, expand and refine metadata tags to enhance their scope and inter-connectivity, fostering discovery of relevant related data.

  6. Improve the visibility and accuracy of the data’s release dates, and ensure clear indicators are present for data updates and changes.

  7. Anticipate and adapt to future data-sharing needs by supporting advanced formats and technologies like, point clouds, 3D, APIs and real-time data streams.

  8. Enhance the user experience for data exploration and navigation, allowing for more intuitive searches and the discovery of relevant new and past data.

  9. Introduce a better search engine syteme or Large Language model chat bot assistance to explore the data sets available.

  10. Simplify the process for users to find and utilize published analyses and reports, making them more accessible.

  11. Clearly differentiate among datasets, analytical reports, and tabular data to prevent confusion and streamline user interaction.

  12. Ensure that file format information is displayed in a clear, concise, and transparent manner, aiding in user comprehension.

  13. Provide a succinct description for each dataset to convey its contents and purpose at a glance.

  14. Implement versatile sorting features, allowing users to filter datasets by relevance, publication date, and other relevant criteria.

  15. Gather, analyse and explore user metrics and their downloads to stay ahead of data demand and learn from users.

Introduction

“What do data analysts want from the data they use and where they access it from? What would a bold and visionary approach to data provision (a new data store) look like?”

There is very often a disconnect between how data custodians or producers structure and provide access to their data compared to the needs of potential end users. In this report, we will look into different classes of users, each with their format priorities while at the same time a lot of common interests. The purpose of this study is to inform the London Datastore Team with practical examples of the platform uses and how improvements can make a big difference scaled to all the users of the London Datastore. As data end-users, data analysts, scientists and researchers, how would recommendations from the report on the London Datastore´s Future 2 affect and help users? We will work backwards from them and the outputs to the London Data Store and its sources. We will also elaborate on the effect of the implementation of the Data Services Standard identified in the GLA´s Data Services Investment Proposal (2021) 3 – more specifically, looking at the users needs and problems, how to facilitate the accessibility to data and how some changes will incentivise the use of the platform.

This will lead us to mention the so-called ´Tidy Data Principles´4 and similar frameworks for optimal data set formats and efficient standardisation. We will be in touch with other experts in the topic at UCL who have developed workflows and efficient systems such as the data curation software Whyqd 5. We can draw inspiration from other cities/platforms as well. We will extend the recommendations that are found at the end of ‘Discovering the Future of the London Data Store’ with tangible examples which show how they affect the end-user’s experience. This will also elaborate on the ‘Data Services Investment Proposition(2021)’ section on Data Services Standards From an academic’s, an analyst’s and researcher’s perspective. Different users of the London datastore but with similar workflows.

1 Background of the report

1.1 Previous Research to design the future London Datastore

Our study expands on the findings of extensive research done by GLA and the Open Data Institute. The current improvements identified in the Future of the London Datastore report are the following:

  1. Up to date data
  2. Broader datasets and types of data
  3. Metadata and different views for insights
  4. Improving navigation and search function - including expanding categories and better description of them
  5. Offering signpost and updates indicators
  6. Offering different formats – and allowing interactive visualisation and charts, but also better integration between data and analytical outputs

We will do the exercise of diving deeper into the workflow affected by the mentioned points and elaborate their impact on analytics projects.

It will include the observation from the “Data Services Investment Proposal (2021) - Appendix B”6 and at the same time, we will expand the “Data Service Standards” proposed concerning the user:

  1. Understand users and their needs. Look at the full context to understand what the user is trying to achieve, not just the part where they interact with the GLA. Understanding as much of the context as possible gives you the best chance of meeting users’ needs in a simple and cost effective way. In a data services context, the core “user” may be a particular type of officer in a London Borough, who has their own particular work context to understand.
  2. Solve a whole problem for users. Work towards creating a service that solves one whole problem for users, collaborating across organisational boundaries where necessary. This may mean aligning with other parts of the GLA or central government to combine data or other supporting services, or simply ensuring that the data service can connect seamlessly with related tasks, for example by ensuring that it uses a common language and that data is exportable in a helpful format.
  3. Provide a joined up experience across all channels [less relevant to data services, as they are generally not consumed e.g. offline]
  4. Make the service simple to use: make sure the service helps the user to do the thing they need to do as simply as possible - so that people succeed first time, with the minimum of help; test for usability frequently with actual and potential users, using appropriate research techniques ; design the service to work with a range of devices that reflects users’ behaviour - in a data services context, most services may be delivered to officers’ laptops and desktops, but should take account of security/firewall limitations, browsers, etc in their corporate environments.
  5. Make sure everyone can use the service: where users are officers, this will mainly mean meeting accessibility standards .

We will proceed with tangible examples from the users perspective.

2 A modern data analysis & mapping project workflow

This is a hands-on journey from idea to publication highlighting where the London Datastore (LDS) can support and facilitate most user´s workflows. Most research pieces involving data commonly require a series of tasks, transformations and operations which can be extremely time consuming, especially if we aggregate it to the whole user base of the London Datastore. The benefits from providing data the right way will accelerate and multiply the use and impact of the LDC on Londoners.

2.1 Tools for data and maps

Modern data users will most likely combine several tools to extract insights out of data. Here is a non-exhaustive list of the most commonly used tools (their importance varies by industry, sector and purpose):

  • Excel: A ubiquitous tool for basic data analysis, manipulation, and visualization, especially in business contexts.
  • Python: A versatile programming language with extensive libraries for data analysis (e.g., Pandas, NumPy, SciPy, Matplotlib).
  • R: A programming language and environment focused on statistical computing and graphics, widely used in academia and research. This report and analysis was created purely with R and RStudio.
  • SQL: Essential for managing and querying structured data in relational databases.
  • Tableau: A powerful tool for data visualization, enabling the creation of interactive and shareable dashboards.
  • Power BI: Microsoft’s business analytics service, providing interactive visualizations and business intelligence capabilities.
  • SAS: An integrated software suite for advanced analytics, business intelligence, data management, and predictive analytics.
  • SPSS (IBM SPSS Statistics): Widely used for statistical analysis in social science, it offers a range of analytical tools.
  • Matlab: A high-level language and interactive environment used heavily in engineering and scientific computing.
  • Stata: A tool for data manipulation, visualization, statistics, and automated reporting.

2.2 Modern Geographic Information Science tools

  • ArcGIS: A comprehensive GIS software for creating, analyzing, and managing geographic data, widely used in various industries.

  • QGIS (Quantum GIS): An open-source GIS software that supports viewing, editing, and analysis of geospatial data.

  • R: Known for its strong statistical capabilities, R has packages like sp, rgdal, and rgeos, which are used for spatial data analysis and visualization.

  • Python: A versatile programming language with GIS-focused libraries like Geopandas, Shapely, Fiona, and Pyproj, widely used for scripting and automating GIS processes.

Now that we know the tools, we have to find the right data in the right format.

2.3 Exploring data and finding data

First of all let’s establish the scenarios you as a user will encounter.

A .You know what you are looking for:

  1. You find what you want and more than you expected: Best case.

  2. You find what you are looking for and nothing else: Great case.

  3. You don’t find what you want but find something useful: Good case.

  4. You don’t find what you want and nothing else is useful. Bad case.

B. You don’t know what you are looking for:

  1. You discover something to start new project or to influence a project: Excellent case.
  2. The exploration process and data discovered doesn’t inspire you to start a project or include in your projects: Bad case.

Any change on the platform should aim to increase the likelihood of scenarios 1,2,3,5 and decrease 4 and 6. We will gathering recommendations for that purpose.

2.4 Empowering the catalog with better meta data

2.4.1 Meta-data analysis of the current catalog

2.4.1.1 Loading all the metadata of the Datastore as of 22.11.2023
Code
metadata <- read_excel("Data/All meta data - export -2023.12.06.xlsx")

As researchers and analysts land on a data platform these common questions arise:

  • Is this the latest data available? How much of the data is being updated?

  • Are old datasets being refreshed?

  • Are many new datasets appearing regularly?

  • Are these datasets a one off publication?

The graphs below provide answers and give an overview of the update rates and status.

[analysis to be continued]

Code
#Convert your string to a POSIXct object, making sure to specify the correct timezone if needed
vertical_line_datetime <- as.POSIXct("2020-01-01", tz = "UTC")

every_nth = function(n) {
  return(function(x) {x[c(TRUE, rep(FALSE, n - 1))]})
}


# Plot with the adjusted geom_hline for the vertical line
ggplot(metadata %>% 
          mutate(short = str_sub(metadata$title, 1, 20)) %>%
          head(1200) %>% 
          arrange(desc(updatedAt)), 
       aes(x = reorder(title, updatedAt))) +
  geom_segment( aes(y=createdAt, yend=updatedAt, xend=title), color="grey") +
  geom_point( aes(y=updatedAt), color=rgb(0.2,0.7,0.1,0.5), size=1) +
  geom_point( aes(y=createdAt), color=rgb(0.7,0.2,0.1,0.5), size=1 ) +
  coord_flip() +
  geom_hline( yintercept = vertical_line_datetime, col = "red") + # Use geom_hline with yintercept
  theme_minimal() +
  theme(legend.position = "none",
        axis.text.y = element_text(size = rel(0.5))) +
  
  scale_x_discrete(breaks = every_nth(n = 10))+
  xlab("Title (every 10th of 1168)") +
  ylab("Added date and update date")+
  labs(title = "Dataset date and update",
       subtitle = "As of 2023.12.06 there are 1166 datasets")

2.5 Update frequency

[analysis to be continued]

Code
chart_df <- metadata %>% 
  select(update_frequency) %>% 
  group_by(update_frequency) %>%
  summarise(count =n()) %>% 
  mutate(total = sum(count))

ggplot(data = chart_df, aes(x =count, y = reorder(update_frequency, (count))))+
    geom_bar(stat="identity", fill ="Steelblue") +
    geom_text(aes(label = paste0(" ",count, " - ", round(count/total,2)*100,"%")), hjust =0) +
    theme_minimal()+
  labs(title ="The distribution of the update frequency available for 1166 datasets",
       x= "Number of datasets",
       y="")+
    scale_x_continuous(limits = c(0, max(chart_df$count) * 1.15))  # Expanding the limit to 20% more than the max count

2.6 Data Publishers

The “Greater London Authority (GLA)” label is the largest publisher with 476 datasets equal to 41%, followed by 13% by ONS. 8% of datasets or from publishers which only published one set. This representation also shows that GLA infact has other subcategories of publishers such as the “GLA Opinion research 4%” which combined then represent more than 3/4 of datasets.

Code
chart_df <- metadata %>% 
  select(publisher) %>% 
  group_by(publisher) %>%
  summarise(count =n()) %>% 
  mutate(publisher_2 = ifelse(count == 1, "***Single publication only", as.character(publisher))) %>%
  group_by(publisher_2) %>%
  summarise(summ = sum(count)) %>%
    mutate(total = sum(summ))%>%
  mutate(GLA = (grepl("GLA|Greater London Authority", publisher_2)))

  ggplot(data = chart_df, aes(x =summ, y = reorder(publisher_2, summ)))+
    geom_bar(aes(fill = GLA),stat="identity") +
    geom_text(aes(label = paste0(" ",summ, " - ", round(summ/total,2)*100,"%")), hjust =0) +
    theme_minimal()+
  labs(title ="Publishers of the 1166 datasets",
       x= "Number of datasets",
       y="") +
  scale_fill_manual(values = c("Steelblue", "Darkred"))+
  scale_x_continuous(limits = c(0, max(chart_df$summ) * 1.15))  # Expanding the limit to 20% more than the max count

2.7 Data Authors

Most authors only publish 1 or 2 datasets in the LDS with 22% of the catalogue. the Label NA 14% appears for 14% meaning the lack of author or missing information.

We also find Greater London Authority as the most prolific author of datasets, but we also find inconsistent labeling and sub-GLA groups:

Greater London Authority 13%, GLA5%, Greater London Authority (GLA) 1%, GLA Economics 3% etc. Other examples of inconsistent labels happend with Transport for London (TFL) or Census data etc. Correcting these is possible and would make the catalog clearer, more efficient and transparent.

Code
chart_df <-metadata %>% 
  select(author) %>% 
  group_by(author) %>%
  summarise(count =n()) %>% 
  mutate(publisher_2 = ifelse(count <= 2, "***1 or 2 publicatiions", as.character(author))) %>%
  group_by(publisher_2) %>%
  summarise(summ = sum(count))%>%
    mutate(total = sum(summ))%>%
  mutate(GLA = (grepl("GLA|Greater London Authority", publisher_2)))


  ggplot(data =chart_df, aes(x =summ, y = reorder(publisher_2, summ)))+
    geom_bar(aes(fill = GLA),stat="identity") +
    geom_text(aes(label = paste0(" ",summ)), hjust =0) +
    geom_text(aes(label = paste0(" ",summ, " - ", round(summ/total,2)*100,"%")), hjust =0) +
    theme_minimal()+
  labs(title ="Authors of the 1166 datasets",
       x= "Number of datasets",
       y="")+
  scale_x_continuous(limits = c(0, max(chart_df$summ) * 1.15)) + # Expanding the limit to 20% more than the max count
  scale_fill_manual(values = c("Steelblue", "Darkred")) 

2.7.0.1 Meta data interconnection

The are currently 18 main tag categories, with many sub-tags. Here is a visualisation to see the co-occurrence of labels in all 1168 data sets.

The diagonal of this table shows the total number of datasets in that category, the largest being “Demographics” with 217 data sets. The other tiles show the co-occurrence of that label with the horizontal topic label (or vertical if you prefer).

Code
# Assuming 'your_matrix' is the 18x18 matrix you want to display
# your_matrix <- matrix(rnorm(18*18), nrow = 18, ncol = 18)

## Convert the matrix to a data frame in long format
matrix_long <- as.data.frame(co_occurrence_matrix_18) %>%
  rownames_to_column(var = "row") %>%
  pivot_longer(cols = -row, names_to = "column", values_to = "value") %>%
  mutate(row = str_replace_all(row, "topic-", ""))%>%
  mutate(column = str_replace_all(column, "topic-", ""))


order_vector <- metadata %>%
  select(id, starts_with("topic")) %>%
  pivot_longer(cols = starts_with("topic"),names_to ="topic", values_to ="is")%>%
  filter(is == "TRUE") %>%
  group_by(topic) %>% 
  summarise(n=n())%>%
  mutate(topic = str_replace_all(topic, "topic-", "")) %>% arrange(desc(n))

order_vector_1 <- as.character(order_vector$topic)



# Convert 'row' and 'column' to factors with levels specified by 'order_vector'
matrix_long$row <- factor(matrix_long$row, levels = order_vector_1)
matrix_long$column <- factor(matrix_long$column, levels = order_vector_1)

ggplot(matrix_long, aes(x = column, y = row, fill = value)) +
  geom_tile(color = "white") +
  geom_text(aes(label = round(value, 2)), vjust = 1) +
  scale_fill_distiller(palette = "Spectral") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        axis.title = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        axis.ticks = element_blank()) +
  coord_fixed() +
  labs(title ="Matrix of co-occurrence of topics")

2.7.0.2 A Network analysis approach

The labels and tags can be represented as network, to visualise and calculate centrality measures and let patterns emerge. Each major category is a node, and each edge link is when a dataset contains both categories of labels. The larger the node, the more datasets it has and the largerthe edges the mor connected both nodes are.

[analysis to be continued]

2.7.0.3 Granular Tag Analysis

There are many more sub-tags beyond the main 18 categories. Without correcting for typos and different spellings, there are 1551 different tags. Now, it is important to re-emphasize and create a new systematic tagging system for a richer meta-data ecosystem, for data exploration and easier a search method of the catalogue.

The vast majority of tags appears less than 5 times. More tagging could help link datasets and create clusters beyond the 18 main categories. Search engines and large language model (LLM) assistants would benefit from it of it and help users.

2.8 Data access and formats

Encountered issues:

  • There isn’t a clear distingtion between Raw Tabular data from Reports and Publications in pdf: https://data.london.gov.uk/dataset/2021-census-first-release

  • Hundreds of pdfs in one place, without a summary or short description. https://data.london.gov.uk/dataset/aberfeldy-estate-consultation-documents-16-november—16-december

  • Inconvenient formats: column names are in a separate file from the data: https://data.london.gov.uk/dataset/london-fire-brigade-incident-records

  • Outdated: Shapes for geographic boundaries and files. For London we had to go directly to ONS to look for the latest geographic boundaries, something required for most GIS and mapping work in London. Link to the latest CENSUS geography ^[Latest CENSUS 2021 Geographies: https://www.ons.gov.uk/methodology/geography/ukgeographies/censusgeographies/census2021geographies]

  • The flow of exploration of data sets is not optimal. Long scrolling is necessary, without sorting options by date or other relevant dimensions. The filtering options could be improved as well with with access to.

  • Outdated data links. Crowd-source the update-control process by adding a possibility for users to signal the London Data Store team when a link is broken or a new link or data source is available. Example of broken links: ^[Broken Links to tfl api : https://data.london.gov.uk/dataset/tfl-live-bus-arrivals-api] ^[Broken link to tl planner: https://data.london.gov.uk/dataset/journey-planner-api-beta] ^[ Broken linkt to tfl feeds: https://tfl.gov.uk/info-for/open-data-users/our-feeds]

  • Include and highlight more the date of the data set or there is a risk it may be outdated or havecurrently broken link.

  • There aren’t any links to LiDAR data sets, which are public 3D data sets.The store could potentially link to governmental open-access LiDAR sources Governmental open-access LiDAR source: https://www.data.gov.uk/dataset/f0db0249-f17b-4036-9e65-309148c97ce4/national-lidar-programme

3 APIs and the future of real-time data

The API pipeline is not easily visible on the store and could be put forward as a key asset of the platform for developers and advanced data science users.

API documentation is present but there aren’t any tangible examples to facilitate the use of it. Users have to be already an advanced user to access it or spent a significant amount of time to discover the possibilities.

Guidance and examples with a more prominent presence on the platform may attract more users.

[MORE ON REAL-TIME USE AND FUTURE OF API]

[ANALYSIS TO BE CONTINUED]

3.1 Accessing the meta data through the API

Code
library(httr)
library(jsonlite)

# The API endpoint for the dataset you want to access
url <- "https://data.london.gov.uk/api/dataset/number-bicycle-hires"

3.2 Real-time present and future

[to be added]

3.3 Tidy data principles

A huge amount of effort is spent cleaning data to get it ready for analysis, this is known burden in the Data Science world. Hadley wickham developed a framework that makes it easy to tidy messy data sets. The definition is simple: structure: each variable is a column, each observation is a row, and each type of observational unit is a table. Tidy data sets are easy to manipulate, model and visualize. We will show the advantages of a consistent data structures with case studies from the London Datastore and how everyone could benefit from it by beeing freed from mundane and long data manipulation chores.

Tidy Data - Hadley Wickham synthesize and created this efficient framework in Data Science and data architecture published in a now acclaimed publication of Journal of Statistical Software 7 and is the best practisse amongs data scientists and researchers.

More on data curation software: Whyqd [to be added]

3.4 Loading Geographic Data

Geographic data is a major pillar of the London Datastore due the fact that most of its data sets contain spatial elements in form of coordinates, administrative boundaries or other geographic boundaries. Making the access to the most commonly used shapes easy, simple and friction-less could significantly improve the efficiency and use of the platform.

Another complementary option would be cataloguing or creating a list for London a list of links to the latest relevant GIS resources such as Ordnance Survey (OS)8 , Office of national Statisticis (ONS)9 or CENSUS geographies (ONS)10

[analysis to be continued]

[analysis to be continued]

3.5 Exploratory Analysis and data wrangling

3.5.1 Filter Data

3.5.2 Sub-setting, selecting and filtering the data.

For research in London there is often required to look at national level or regional levels datasets and then there is the need to narrow it down to the 32 Boroughs and the City. Any help or national dataset that can be sub-set and filtered to the Greater London Area and stored in the LDC would be a great asset and support for analytics.

Code
# 
# # Filter for MSOAs in London based on the known London boroughs
# 
# london_boroughs <- c("City of London", "Camden", "Greenwich", "Hackney", "Hammersmith and Fulham",
# 
#                      "Islington", "Kensington and Chelsea", "Lambeth", "Lewisham", "Southwark",
# 
#                      "Tower Hamlets", "Wandsworth", "Westminster", "Barking and Dagenham", "Barnet",
# 
#                      "Bexley", "Brent", "Bromley", "Croydon", "Ealing", "Enfield", "Haringey",
# 
#                      "Harrow", "Havering", "Hillingdon", "Hounslow", "Kingston upon Thames",
# 
#                      "Merton", "Newham", "Redbridge", "Richmond upon Thames", "Sutton", "Waltham Forest")
# 
# # Combine them into a single regular expression
# 
# london_filter <- paste(london_boroughs, collapse = "|")
# 
# # Use this pattern with str_detect in a filter
# 
# london_sf <- london_boundaries_sf %>%
# 
#   as_tibble() %>%
# 
#   filter(str_detect(MSOA21NM, london_filter))%>%
# 
#   st_as_sf()

3.5.3 Data Transformation (splitting, aggregating, summarising etc.)

[analysis to be continued]

3.5.4 Data and merging

[analysis to be continued and examples to be added]

3.5.5 Data Filtering and sorting

[analysis to be continued and examples to be added]

3.5.6 Data correcting / cleaning

[analysis to be continued and examples to be added]

3.5.7 Gap filling / modelling

[analysis to be continued and examples to be added]

3.5.8 Data formatting and pivoting (wide to long / long to wide)

[analysis to be continued and examples to be added]One of the most common data operations at the start of any analysis in EXcel, R, Python, transformations

3.5.9 Geocoding

[analysis to be continued and examples to be added]

When coordinates are missing or geographic shapes are apsent, GIS analysts and researchers need to apply Geocoding, which in spatial data science refers to the process of converting addresses or other geographic descriptors into numerical coordinates on the Earth’s surface. Typically, this involves transforming a description like a street address, city name, or postal code into a precise latitude and longitude. These coordinates can then be used for various purposes such as mapping, spatial analysis, and geographic data visualization. [to be continued]

3.5.10 Storing / retrieving / updating

[analysis to be continued and examples to be added]

3.5.11 Data persistence and linking to historical data and updates etc.

[analysis to be continued and examples to be added]

3.5.12 Example: London static map

Link to geo file: https://data.london.gov.uk/dataset/ultra_low_emissions_zone

[analysis to be continued and examples to be added]

Code
# Read your GIS shapefile or geopackage

london_ULEZ_sf <-st_read("Data/London_wide_ULEZ_expansion.gpkg") %>% st_as_sf()
Reading layer `ULEZ' from data source 
  `C:\Users\julia\Hoffmann Dropbox\Julian Hoffmann\0. Julian Studio\1. Julian Studio Projects\2023.08 - UCL-GLA London Data Store\London Datastore Online Report\Data\London_wide_ULEZ_expansion.gpkg' 
  using driver `GPKG'
Simple feature collection with 22 features and 1 field
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 503898 ymin: 156666.1 xmax: 559663.9 ymax: 200881
Projected CRS: OSGB36 / British National Grid
Code
ggplot(data=london_ULEZ_sf)+

  geom_sf()+

  theme_minimal()

Code
# ggplot(london_sf) +
# 
#   geom_sf()+
# 
#   theme_minimal()+
# 
#   labs(title = "London MSOAs")
Code
# london_sf2<- london_sf %>% st_as_sf(coords = c(y =LONG, x = LAT)) %>% st_transform(4326)
# 
# london_sf2

3.5.13 London Interactive Map

Code
library(leaflet)
library(leaflet.providers)
# Create the leaflet map with the shapefile

leaflet_map <- leaflet() %>%
  addProviderTiles(providers$CartoDB.DarkMatter) %>% # Adding a basemap
  #addProviderTiles(providers$ProviderName)

  addPolygons(data = london_ULEZ_sf) # Adding the shapefile
Warning: sf layer is not long-lat data
Warning: sf layer has inconsistent datum (+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs).
Need '+proj=longlat +datum=WGS84'
Code
  leaflet_map

DRAFT - Conclusions

[Conclusions to be added]

Footnotes

  1. Tidy data format definition: link↩︎

  2. (2019 - ODI link)↩︎

  3. (link to appendix B/*)↩︎

  4. Tidy Data Principles: https://r4ds.hadley.nz/data-tidy.html↩︎

  5. Whyqd : https://whyqd.com/↩︎

  6. Data Services Investment Proposal (2021) - Appendix B↩︎

  7. Tidy Data: https://www.jstatsoft.org/article/view/v059i10↩︎

  8. Ordnance Survey: https://www.ordnancesurvey.co.uk/products/boundary-line↩︎

  9. ONS Link https://geoportal.statistics.gov.uk/search?collection=Dataset&sort=name&tags=all(BDY_ADM)↩︎

  10. Census geographies: https://www.ons.gov.uk/methodology/geography/ukgeographies/censusgeographies/census2021geographies↩︎